Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 8 de 8
1.
J Biomed Inform ; 152: 104626, 2024 04.
Article En | MEDLINE | ID: mdl-38521180

OBJECTIVE: The accuracy of deep learning models for many disease prediction problems is affected by time-varying covariates, rare incidence, covariate imbalance and delayed diagnosis when using structured electronic health records data. The situation is further exasperated when predicting the risk of one disease on condition of another disease, such as the hepatocellular carcinoma risk among patients with nonalcoholic fatty liver disease due to slow, chronic progression, the scarce of data with both disease conditions and the sex bias of the diseases. The goal of this study is to investigate the extent to which the aforementioned issues influence deep learning performance, and then devised strategies to tackle these challenges. These strategies were applied to improve hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. METHODS: We evaluated two representative deep learning models in the task of predicting the occurrence of hepatocellular carcinoma in a cohort of patients with nonalcoholic fatty liver disease (n = 220,838) from a national EHR database. The disease prediction task was carefully formulated as a classification problem while taking censorship and the length of follow-up into consideration. RESULTS: We developed a novel backward masking scheme to deal with the issue of delayed diagnosis which is very common in EHR data analysis and evaluate how the length of longitudinal information after the index date affects disease prediction. We observed that modeling time-varying covariates improved the performance of the algorithms and transfer learning mitigated reduced performance caused by the lack of data. In addition, covariate imbalance, such as sex bias in data impaired performance. Deep learning models trained on one sex and evaluated in the other sex showed reduced performance, indicating the importance of assessing covariate imbalance while preparing data for model training. CONCLUSIONS: The strategies developed in this work can significantly improve the performance of hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Furthermore, our novel strategies can be generalized to apply to other disease risk predictions using structured electronic health records, especially for disease risks on condition of another disease.


Carcinoma, Hepatocellular , Deep Learning , Liver Neoplasms , Non-alcoholic Fatty Liver Disease , Humans , Carcinoma, Hepatocellular/diagnosis , Carcinoma, Hepatocellular/epidemiology , Non-alcoholic Fatty Liver Disease/complications , Non-alcoholic Fatty Liver Disease/diagnosis , Non-alcoholic Fatty Liver Disease/epidemiology , Liver Neoplasms/diagnosis , Liver Neoplasms/epidemiology , Electronic Health Records
2.
medRxiv ; 2023 Nov 17.
Article En | MEDLINE | ID: mdl-38014193

Background: Deep learning models showed great success and potential when applied to many biomedical problems. However, the accuracy of deep learning models for many disease prediction problems is affected by time-varying covariates, rare incidence, and covariate imbalance when using structured electronic health records data. The situation is further exasperated when predicting the risk of one disease on condition of another disease, such as the hepatocellular carcinoma risk among patients with nonalcoholic fatty liver disease due to slow, chronic progression, the scarce of data with both disease conditions and the sex bias of the diseases. Objective: The goal of this study is to investigate the extent to which time-varying covariates, rare incidence, and covariate imbalance influence deep learning performance, and then devised strategies to tackle these challenges. These strategies were applied to improve hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Methods: We evaluated two representative deep learning models in the task of predicting the occurrence of hepatocellular carcinoma in a cohort of patients with nonalcoholic fatty liver disease (n = 220,838) from a national EHR database. The disease prediction task was carefully formulated as a classification problem while taking censorship and the length of follow-up into consideration. Results: We developed a novel backward masking scheme to evaluate how the length of longitudinal information after the index date affects disease prediction. We observed that modeling time-varying covariates improved the performance of the algorithms and transfer learning mitigated reduced performance caused by the lack of data. In addition, covariate imbalance, such as sex bias in data impaired performance. Deep learning models trained on one sex and evaluated in the other sex showed reduced performance, indicating the importance of assessing covariate imbalance while preparing data for model training. Conclusions: Devising proper strategies to address challenges from time-varying covariates, lack of data, and covariate imbalance can be key to counteracting data bias and accurately predicting disease occurrence using deep learning models. The novel strategies developed in this work can significantly improve the performance of hepatocellular carcinoma risk prediction among patients with nonalcoholic fatty liver disease. Furthermore, our novel strategies can be generalized to apply to other disease risk predictions using structured electronic health records, especially for disease risks on condition of another disease.

3.
Brief Bioinform ; 24(4)2023 07 20.
Article En | MEDLINE | ID: mdl-37337757

The T-cell receptor (TCR) repertoire is highly diverse among the population and plays an essential role in initiating multiple immune processes. TCR sequencing (TCR-seq) has been developed to profile the T cell repertoire. Similar to other high-throughput experiments, contamination can happen during several steps of TCR-seq, including sample collection, preparation and sequencing. Such contamination creates artifacts in the data, leading to inaccurate or even biased results. Most existing methods assume 'clean' TCR-seq data as the starting point with no ability to handle data contamination. Here, we develop a novel statistical model to systematically detect and remove contamination in TCR-seq data. We summarize the observed contamination into two sources, pairwise and cross-cohort. For both sources, we provide visualizations and summary statistics to help users assess the severity of the contamination. Incorporating prior information from 14 existing TCR-seq datasets with minimum contamination, we develop a straightforward Bayesian model to statistically identify contaminated samples. We further provide strategies for removing the impacted sequences to allow for downstream analysis, thus avoiding any need to repeat experiments. Our proposed model shows robustness in contamination detection compared with a few off-the-shelf detection methods in simulation studies. We illustrate the use of our proposed method on two TCR-seq datasets generated locally.


Receptors, Antigen, T-Cell , T-Lymphocytes , Humans , Bayes Theorem , Receptors, Antigen, T-Cell/genetics , Models, Statistical , High-Throughput Nucleotide Sequencing/methods
4.
Bioinform Adv ; 3(1): vbad029, 2023.
Article En | MEDLINE | ID: mdl-36998720

Motivation: Cell label annotation is a challenging step in the analysis of single-cell RNA sequencing (scRNA-seq) data, especially for tissue types that are less commonly studied. The accumulation of scRNA-seq studies and biological knowledge leads to several well-maintained cell marker databases. Manually examining the cell marker lists against these databases can be difficult due to the large amount of available information. Additionally, simply overlapping the two lists without considering gene ranking might lead to unreliable results. Thus, an automated method with careful statistical testing is needed to facilitate the usage of these databases. Results: We develop a user-friendly computational tool, EasyCellType, which automatically checks an input marker list obtained by differential expression analysis against the databases and provides annotation recommendations in graphical outcomes. The package provides two statistical tests, gene set enrichment analysis and a modified version of Fisher's exact test, as well as customized database and tissue type choices. We also provide an interactive shiny application to annotate cells in a user-friendly graphical user interface. The simulation study and real-data applications demonstrate favorable results by the proposed method. Availability and implementation: https://biostatistics.mdanderson.org/shinyapps/EasyCellType/; https://bioconductor.org/packages/devel/bioc/html/EasyCellType.html. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

5.
Hosp Pediatr ; 12(11): 1011-1019, 2022 11 01.
Article En | MEDLINE | ID: mdl-36263712

BACKGROUND AND OBJECTIVES: Molecular diagnostics provide a rapid and sensitive diagnosis of gastroenteritis compared with a stool culture. In this study, we seek to describe the changes in medical management and outcomes of children with Salmonella gastroenteritis as our hospital system adopted molecular diagnostics. METHODS: This study is a retrospective chart review of children <18 years of age diagnosed with nontyphoidal Salmonella gastroenteritis between 2008 and 2018 at a large pediatric health care system in the southeastern United States. Those with immunocompromising conditions and hemoglobinopathies were excluded. Patients diagnosed via molecular testing were compared with those diagnosed solely by stool culture for aspects of management including admission rates, blood culture obtainment, and antibiotic administration. RESULTS: Of 965 eligible patients with Salmonella gastroenteritis, 264 (27%) had a stool molecular test and 701 (73%) only had a stool culture performed. Groups were similar in age and presentation. Those diagnosed by molecular methods had higher hospitalization rates (69% vs 50%, P <.001), more blood cultures obtained (54% vs 44%, P <.01), and received more antibiotics (49% vs 34%, P <.001) despite statistically similar rates of bacteremia (11% vs 19%, P = .05). CONCLUSIONS: The rapid diagnosis of Salmonella gastroenteritis by molecular methods was associated with increased hospital admission rates, blood culture obtainment, and antibiotic use. This suggests possible overmedicalization of uncomplicated Salmonella gastroenteritis, and clinicians should remain cognizant of the possibility of providing low-value care for uncomplicated disease.


Gastroenteritis , Salmonella , Child , Humans , Infant , Salmonella/genetics , Retrospective Studies , Gastroenteritis/diagnosis , Gastroenteritis/therapy , Anti-Bacterial Agents/therapeutic use , Molecular Diagnostic Techniques
6.
Hosp Pediatr ; 12(7): e225-e229, 2022 07 01.
Article En | MEDLINE | ID: mdl-35726559

BACKGROUND AND OBJECTIVE: The optimal duration of intravenous (IV) antibiotic therapy for children with nontyphoidal Salmonella bacteremia (NTSB) is unknown. The objective of the authors of this study is to evaluate differences in outcomes among children with NTSB who received a short (≤3 days; short-duration group [SDG]) versus long (>3 days; long-duration group [LDG]) course of IV antibiotics. METHODS: This is a retrospective study of children 3 months to 18 years old with NTSB admitted to a tertiary pediatric health care system in the southeastern United States between 2008 and 2018. RESULTS: Among 57 patients with NTSB without focal infection, 24 (42%) were in the SDG and received IV antibiotics for a median of 3.0 days and 33 (58%) were in the LDG and received IV antibiotics for a median of 5.0 days. Demographic and clinical characteristics were similar between the SDG and LDG. The median total duration of antibiotics was 11.5 days in the SDG and 13.0 in the LDG (P = .068). The median length of stay was 3.0 days in the SDG and 4.0 in the LDG (P ≤ .001). Two children in the SDG (8%) and 1 child in the LDG (3%) returned to the emergency department for care unrelated to the duration of their IV antibiotic therapy (P = .567). None of the children were readmitted for sequelae related to salmonellosis. CONCLUSIONS: The duration of IV antibiotics varied for NTSB, but the outcomes were excellent regardless of the initial IV antibiotic duration. Earlier transitions to oral antibiotics can be considered for NTSB.


Anti-Bacterial Agents , Bacteremia , Administration, Intravenous , Anti-Bacterial Agents/therapeutic use , Bacteremia/drug therapy , Child , Humans , Retrospective Studies , Salmonella
7.
Bioinformatics ; 38(8): 2096-2101, 2022 04 12.
Article En | MEDLINE | ID: mdl-35176131

MOTIVATION: Cross-sectional analyses of primary cancer genomes have identified regions of recurrent somatic copy-number alteration, many of which result from positive selection during cancer formation and contain driver genes. However, no effective approach exists for identifying genomic loci under significantly different degrees of selection in cancers of different subtypes, anatomic sites or disease stages. RESULTS: CNGPLD is a new tool for performing case-control somatic copy-number analysis that facilitates the discovery of differentially amplified or deleted copy-number aberrations in a case group of cancer compared with a control group of cancer. This tool uses a Gaussian process statistical framework in order to account for the covariance structure of copy-number data along genomic coordinates and to control the false discovery rate at the region level. AVAILABILITY AND IMPLEMENTATION: CNGPLD is freely available at https://bitbucket.org/djhshih/cngpld as an R package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Genome , Neoplasms , Humans , Cross-Sectional Studies , Genomics , DNA Copy Number Variations , Neoplasms/genetics , Case-Control Studies , Software
8.
Front Immunol ; 12: 691216, 2021.
Article En | MEDLINE | ID: mdl-34177951

Failure of resolution pathways in periodontitis is reflected in levels of specialized pro-resolving lipid mediators (SPMs) and SPM pathway markers but their relationship with the subgingival microbiome is unclear. This study aimed to analyze and integrate lipid mediator level, SPM receptor gene expression and subgingival microbiome data in subjects with periodontitis vs. healthy controls. The study included 13 periodontally healthy and 15 periodontitis subjects that were evaluated prior to or after non-surgical periodontal therapy. Samples of gingival tissue and subgingival plaque were collected prior to and 8 weeks after non-surgical treatment; only once in the healthy group. Metabololipidomic analysis was performed to measure levels of SPMs and other relevant lipid mediators in gingiva. qRT-PCR assessed relative gene expression (2-ΔΔCT) of known SPM receptors. 16S rRNA sequencing evaluated the relative abundance of bacterial species in subgingival plaque. Correlations between lipid mediator levels, receptor gene expression and bacterial abundance were analyzed using the Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) and Sparse Partial Least Squares (SPLS) methods. Profiles of lipid mediators, receptor genes and the subgingival microbiome were distinct in the three groups. The strongest correlation existed between lipid mediator profile and subgingival microbiome profile. Multiple lipid mediators and bacterial species were highly correlated (correlation coefficient ≥0.6) in different periodontal conditions. Comparing individual correlated lipid mediators and bacterial species in periodontitis before treatment to healthy controls revealed that one bacterial species, Corynebacterium durum, and five lipid mediators, 5(S)6(R)-DiHETE, 15(S)-HEPE, 7-HDHA, 13-HDHA and 14-HDHA, were identified in both conditions. Comparing individual correlated lipid mediators and bacterial species in periodontitis before treatment to after treatment revealed that one bacterial species, Anaeroglobus geminatus, and four lipid mediators, 5(S)12(S)-DiHETE, RvD1, Maresin 1 and LTB4, were identified in both conditions. Four Selenomonas species were highly correlated with RvD1, RvE3, 5(S)12(S)-DiHETE and proinflammatory mediators in the periodontitis after treatment group. Profiles of lipid mediators, receptor gene and subgingival microbiome are associated with periodontal inflammation and correlated with each other, suggesting inflammation mediated by lipid mediators influences microbial composition in periodontitis. The role of correlated individual lipid mediators and bacterial species in periodontal inflammation have to be further studied.


Gingiva/metabolism , Gingiva/microbiology , Lipid Metabolism , Metabolome , Microbiota , Periodontitis/metabolism , Periodontitis/microbiology , Adaptor Proteins, Signal Transducing/genetics , Adult , Bacteria/genetics , Female , Humans , Lipids , Male , Middle Aged , RNA, Ribosomal, 16S/genetics , Receptors, Chemokine/genetics , Receptors, G-Protein-Coupled/genetics , Receptors, Leukotriene B4/genetics , Young Adult
...